Objectives:
Create and interpret multiple linear regression models
Create 3-D scatterplots
Learn how to create partial regression plots
Create and interpret additive models
Create and interpret models with interactions
Read in the Bodyfat.csv data set and store the result in bf.
bf <- read.csv("Bodyfat.csv") %>%
clean_names()
knitr::kable(head(bf))
| density | pct_bf | age | weight | height | neck | chest | abdomen | waist | hip | thigh | knee | ankle | bicep | forearm | wrist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.0708 | 12.3 | 23 | 154.25 | 67.75 | 36.2 | 93.1 | 85.2 | 33.54331 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
| 1.0853 | 6.1 | 22 | 173.25 | 72.25 | 38.5 | 93.6 | 83.0 | 32.67717 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
| 1.0414 | 25.3 | 22 | 154.00 | 66.25 | 34.0 | 95.8 | 87.9 | 34.60630 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
| 1.0751 | 10.4 | 26 | 184.75 | 72.25 | 37.4 | 101.8 | 86.4 | 34.01575 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
| 1.0340 | 28.7 | 24 | 184.25 | 71.25 | 34.4 | 97.3 | 100.0 | 39.37008 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 |
| 1.0502 | 20.9 | 24 | 210.25 | 74.75 | 39.0 | 104.5 | 94.4 | 37.16535 | 107.8 | 66.0 | 42.0 | 25.6 | 35.7 | 30.6 | 18.8 |
Reproduce Figure 9.1 from page 282 of your text book.
# Your Code Goes HERE
Verify the least squares regression line provided on page 283.
# Your Code Goes HERE
The least squares regression line for regressing pct_bf onto waist is \[\widehat{\text{pct_bf}} = xxx + xxx \times \text{waist}\]
Find the least squares regression equation for regressing pct_bf onto waist and height. Store the result in mod_mr.
# Your Code Goes HERE
mod_mr <- lm(pct_bf ~ waist + height, data= bf)
The least squares regression equation (a plane in this case) for regressing pct_bf onto waist and height is \[\widehat{\text{pct_bf}} = xxx + xxx \times \text{waist} -xxx \times \text{height}\]
An alternative way to visualize a multiple regression model with two numeric explanatory variables is as a plane in three dimensions. This is possible in R using the plotly package.
There are three objects you will need:
x: a vector of unique values of waisty: a vector of unique values of heightplane: a matrix of the fitted values across all combinations of x and yMuch like ggplot(), the plot_ly() function will allow you to create a plot object with variables mapped to x, y, and z aesthetics. The add_markers() function is similar to geom_point() in that it allows you to add points to your 3D plot.
Note that plot_ly uses the pipe (%>%) operator to chain commands together.
library(plotly)
# draw the 3D scatterplot
p <- plot_ly(data = bf, z = ~pct_bf, x = ~waist, y = ~height, opacity = 0.6) %>%
add_markers()
p
Adding the plane to the 3D plot
summary(mod_mr)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.1008816 7.68610576 -0.4034399 6.869737e-01
waist 1.7730935 0.07158344 24.7696056 6.794831e-69
height -0.6015392 0.10993883 -5.4715806 1.093093e-07
x <- seq(25, 50, length = 70)
y <- seq(60, 80, length = 70)
plane <- outer(x, y, function(a, b){summary(mod_mr)$coef[1,1] +
summary(mod_mr)$coef[2,1]*a + summary(mod_mr)$coef[3,1]*b})
# draw the plane
p %>%
add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
Read in the Real_Estate.csv data set and store the result in re.
re <- read.csv("Real_Estate.csv") %>%
clean_names()
names(re)
[1] "price" "living_area" "bedrooms" "bathrooms"
[5] "year" "garage" "date_collected" "location_type"
[9] "urban" "suburb" "rural"
# Create age variable
re <- re %>%
mutate(age = 2022 - year)
Find the least squares equation for regressing price onto living_area and bedrooms. Store the result of the the lm() call in mod_re.
# Your Code Goes HERE
The least squares regression equation for regressing price onto living_area and bedrooms is \[\widehat{\text{price}} = xxx + xxx \times \text{living_area} -xxx \times\text{bedrooms}\]
Explore the data!
library(GGally) # load GGally package
ggpairs(data = re,
columns = c("living_area", "age", "bedrooms", "price"),
aes(alpha = 0.01)) +
theme_bw()
Read in the data set Housing_prices.csv and store the results in hp.
hp <- read.csv("Housing_prices.csv") %>%
clean_names()
names(hp)
[1] "price" "living_area" "bathrooms" "bedrooms" "fireplaces"
[6] "lot_size" "age" "fireplace"
Explore the data!
library(GGally) # load GGally package
ggpairs(data = hp,
columns = c("living_area", "age", "bedrooms", "price"),
aes(alpha = 0.001)) +
theme_bw()
Note: the scatterplot of price versus age is not linear! To straighten the relationship, the book takes the log10(age + 1).
hp <- hp %>%
mutate(log_age = log10(age + 1))
ggpairs(data = hp,
columns = c("living_area", "log_age", "bedrooms", "price"),
aes(alpha = 0.001)) +
theme_bw()
Find the least squares regression equation for regressing price onto living_area, log_age, and bedrooms. Store the result of the lm() call in mod_hp.
mod_hp <- lm(price ~ living_area + log_age + bedrooms, data = hp)
summary(mod_hp)
Call:
lm(formula = price ~ living_area + log_age + bedrooms, data = hp)
Residuals:
Min 1Q Median 3Q Max
-244477 -25752 -3526 16086 398126
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44797.165 8356.609 5.361 1.02e-07 ***
living_area 87.260 3.365 25.928 < 2e-16 ***
log_age -14439.082 2991.365 -4.827 1.59e-06 ***
bedrooms -5902.756 2773.934 -2.128 0.0336 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 49620 on 1053 degrees of freedom
Multiple R-squared: 0.5876, Adjusted R-squared: 0.5864
F-statistic: 500.1 on 3 and 1053 DF, p-value: < 2.2e-16
coef(mod_hp)
(Intercept) living_area log_age bedrooms
44797.1647 87.2598 -14439.0815 -5902.7556
[] notation means “all but”). Thus the subscript says “residuals after regression on all but \(x_1\).”Create a partial regression plot for the coefficient of height in the multiple regression model mod_mr.
# Recall mod_mr
mod_mr <- lm(pct_bf ~ waist + height, data= bf)
summary(mod_mr)
Call:
lm(formula = pct_bf ~ waist + height, data = bf)
Residuals:
Min 1Q Median 3Q Max
-11.1692 -3.4133 -0.0977 3.0995 9.9082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.10088 7.68611 -0.403 0.687
waist 1.77309 0.07158 24.770 < 2e-16 ***
height -0.60154 0.10994 -5.472 1.09e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.46 on 247 degrees of freedom
Multiple R-squared: 0.7132, Adjusted R-squared: 0.7109
F-statistic: 307.1 on 2 and 247 DF, p-value: < 2.2e-16
my.2 <- lm(pct_bf ~ waist, data = bf)
ey.2 <- residuals(my.2)
m2.2 <- lm(height ~ waist, data = bf)
e2.2 <- residuals(m2.2)
df <- data.frame(ey.2 = ey.2, e2.2 = e2.2)
ggplot(data = df, aes(y = ey.2, x = e2.2)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE) +
theme_bw()
coefficients(lm(ey.2 ~ e2.2))
(Intercept) e2.2
6.758092e-17 -6.015392e-01
Using the effects package functions to create a similar plot.
library(effects)
e2 <- predictorEffect("height", mod_mr, residuals = TRUE)
plot(e2)
Create a partial regression plot for the variable waist in the multiple regression model mod_mr.
# Your Code Goes HERE
Use the effects package to create a similar plot.
# Your Code Goes HERE
Note that the preditorEffect() changes to preditorEffects() when not specifying the variable for a single partial plot. See the vignettes from package effects for more details on the available arguments for the predictorEffect() and predictorEffects() functions.
plot(predictorEffects(mod_mr, residuals = TRUE))
Read in the Coasters_2015.csv data set and store the results in coasters.
coasters <- read.csv("Coasters_2015.csv") %>%
clean_names()
names(coasters)
[1] "name" "park" "track" "speed" "height"
[6] "drop" "length" "duration" "inversions"
Create a subset of coasters named sub where only rows where neither the duration nor the drop is missing. Also remove the Tower of Terror and Xcelerator coasters as Tower of Terror has been discontinued and the Xcelerator uses a different method of acceleration so its largest drop is not the source of speed.
sub <- coasters %>%
filter(!is.na(duration) & !is.na(drop)) %>%
filter(name != "Tower of Terror" & name != "Xcelerator")
knitr::kable(head(sub))
| name | park | track | speed | height | drop | length | duration | inversions |
|---|---|---|---|---|---|---|---|---|
| Millennium Force | Cedar Point | Steel | 93 | 310 | 300 | 6595 | 165 | 0 |
| Goliath | Six Flags Magic Mountain | Steel | 85 | 235 | 255 | 4500 | 180 | 0 |
| Titan | Six Flags Over Texas | Steel | 85 | 245 | 255 | 5312 | 210 | 0 |
| Desperado | Buffalo Bill’s Resort & Casino | Steel | 80 | 209 | 225 | 5843 | 163 | 0 |
| Nitro | Six Flags Great Adventure | Steel | 80 | 230 | 215 | 5394 | 240 | 0 |
| Superman - Ride Of Steel | Six Flags New England | Steel | 77 | 208 | 221 | 5400 | 155 | 0 |
Currently, inversions is a numerical variable with two values (0 and 1). Make inversions a categorical variable with values No and Yes corresponding to the 0s and 1s.
sub <- sub %>%
mutate(inversions = ifelse(inversions == 0, "No", "Yes"))
knitr::kable(head(sub))
| name | park | track | speed | height | drop | length | duration | inversions |
|---|---|---|---|---|---|---|---|---|
| Millennium Force | Cedar Point | Steel | 93 | 310 | 300 | 6595 | 165 | No |
| Goliath | Six Flags Magic Mountain | Steel | 85 | 235 | 255 | 4500 | 180 | No |
| Titan | Six Flags Over Texas | Steel | 85 | 245 | 255 | 5312 | 210 | No |
| Desperado | Buffalo Bill’s Resort & Casino | Steel | 80 | 209 | 225 | 5843 | 163 | No |
| Nitro | Six Flags Great Adventure | Steel | 80 | 230 | 215 | 5394 | 240 | No |
| Superman - Ride Of Steel | Six Flags New England | Steel | 77 | 208 | 221 | 5400 | 155 | No |
Create a scatterplot of duration versus drop color coded by inversions using the data in sub.
# Your Code Goes HERE
ggplot(data = sub, aes(x = drop, y = duration, color = inversions)) +
geom_point() +
theme_bw() +
geom_smooth(method = "lm", se = FALSE) +
labs(color = "Inversion Legend")
Consider the regression model which graphs the lines in the previous plot:
mod_iv <- lm(duration ~ drop + inversions + drop:inversions, data = sub)
summary(mod_iv)
Call:
lm(formula = duration ~ drop + inversions + drop:inversions,
data = sub)
Residuals:
Min 1Q Median 3Q Max
-56.653 -23.107 3.363 19.856 87.806
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98.63790 12.71503 7.758 1.74e-11 ***
drop 0.35803 0.07402 4.837 5.81e-06 ***
inversionsYes -9.43721 22.19237 -0.425 0.672
drop:inversionsYes -0.02214 0.15922 -0.139 0.890
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.11 on 85 degrees of freedom
Multiple R-squared: 0.3257, Adjusted R-squared: 0.3019
F-statistic: 13.68 on 3 and 85 DF, p-value: 2.311e-07
coefficients(mod_iv)
(Intercept) drop inversionsYes drop:inversionsYes
98.63789581 0.35802501 -9.43721022 -0.02213819
# Intercept for Inversions
coefficients(mod_iv)[1] + coefficients(mod_iv)[3]
(Intercept)
89.20069
# Slope for Inversions
coefficients(mod_iv)[2] + coefficients(mod_iv)[4]
drop
0.3358868
# Intercept for No Inversions
coefficients(mod_iv)[1]
(Intercept)
98.6379
# SLope for No Inversions
coefficients(mod_iv)[2]
drop
0.358025
Consider a simpler model that has the same slope and a different intercept for the coasters that have inversions and those that do not have inversions. This model is called a parallel slopes model.
mod_ps <- lm(duration ~ drop + inversions, data = sub)
summary(mod_ps)
Call:
lm(formula = duration ~ drop + inversions, data = sub)
Residuals:
Min 1Q Median 3Q Max
-57.569 -22.868 2.804 19.521 87.740
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.39374 11.42858 8.697 2.04e-13 ***
drop 0.35324 0.06516 5.421 5.34e-07 ***
inversionsYes -12.34811 7.31876 -1.687 0.0952 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 32.92 on 86 degrees of freedom
Multiple R-squared: 0.3255, Adjusted R-squared: 0.3098
F-statistic: 20.75 on 2 and 86 DF, p-value: 4.422e-08
To graph a parallel slopes model with ggplot() we will use the geom_parallel_slopes() function from the moderndive package.
library(moderndive)
ggplot(data = sub, aes(x = drop, y = duration, color = inversions)) +
geom_point() +
geom_parallel_slopes(se = FALSE) +
theme_bw()
Use mod_ps to predict the duration of Hangman and Hayabusa. Note that there are typos in the book for the prediction of Hayabusa.
sub %>%
filter(name == "Hangman" | name == "Hayabusa") %>%
knitr::kable()
| name | park | track | speed | height | drop | length | duration | inversions |
|---|---|---|---|---|---|---|---|---|
| Hangman | Wild Adventures | Steel | 55.0 | 115.0 | 95.00 | 2170.0 | 125 | Yes |
| Hayabusa | tokyo SummerLand | Steel | 60.3 | 137.8 | 124.67 | 2559.1 | 108 | No |
| Hayabusa | Tokyo SummerLand | Steel | 60.3 | 137.8 | 124.67 | 2559.1 | 108 | No |
hayabusaD <- predict(mod_ps, newdata = data.frame(drop = 124.67, inversions = "No"))
hayabusaD
1
143.4323
hangmanD <- predict(mod_ps, newdata = data.frame(drop = 95, inversions = "Yes"))
hangmanD
1
120.6035